Employee Attrition Modeling

This project is a demonstration in predictive modeling for employee attrition using machine learning. The dataset used for this project is a fictitious dataset originally found on Kaggle but the source is no longer available.

This notebook will walkthrough the end-to-end process of exploring a dataset, preprocessing features, building models, and evaluating models. The libraries used in this project are Pandas, Numpy, Plotly, Seaborn, Matplotlib, Scikit-Learn, and Scikitplot.

Note: If viewing this directly on Github, you will not be able to see the Plotly graphs due to the embedded JavaScript and HTML that will not output on Github's static render. For full render viewing please enter the Github URL of this notebook into the following site: https://nbviewer.jupyter.org. If the link does not work please download the HTML file from this repo, or clone the full repo if you wish to run the notebook locally.

Imports

Exploratory Data Analysis

Key observations:

  • Satisfaction levels for leavers are lower compared to active employees
  • Evaluation scores only differ marginally between the two groups
  • Leavers were far less likely to have received a promotion in the last 5 years
  • Leavers worked more hours
  • Leavers have spent more time with the company

  • All of these observations make sense as performance did not differ significantly yet those who left the organization received far less promotions and worked longer hours on average than those who are still with the organization, and also have been with the company longer. Naturally it is easy to imagine how this would contribute to lower satisfaction levels. From this we can infer that satisfaction level could be a very useful attribute for predicting employee churn, as it can generally be said that dissatisfied employees are more likely to leave an organization than satisfied employees. A more in-depth analysis is needed but first let's take a high level analysis on all of the variables to understand their distribution.

    There are two categorical variables in our data: department and salary. We can look at how many people are leaving as well as the attrition percentage from each department and salary level.

    We can see that the Sales, Technical, and Support departments show the highest number of leavers by far. However, it is important to also note that these departments show significantly more active employees than all other departments as well. As such, these are the three largest departments in the organization so they will always show a higher number of leavers due to their sheer size. However, we can look at the attrition percentage in each department to see if people are leaving these departments at a relatively greater rate compared to the other departments.

    We can see the the relative attrition percentage of the Sales, Technical, and Support departments are not significantly higher than all of the other departments. In fact all of the departments with the exception of Management and R&D are fairly comparable, indicating that no one department is suffering from a particularly higher attrition (which is an observation we might see if some departments were problematic, such as having a toxic culture or leadership within the department that is causing a high attrition).

    Next we will look at the same with regards to salary level.

    We see the most leavers in the low salary level and the least in the high salary level, which is an expected observation. Let's also look at the relative attrition percentages within each level get a sense of proportion.

    We can dig a little deeper and look at the attrition rate within each department, this time focusing only on leavers.

    We can see from this that it is not always the case that higher salary levels show a higher attrition than lower levels, such as that for Accounting, HR, and R&D where there are more leavers in the medium salary level. For leavers in the high salary level, since the values are so low across all departments, Plotly cannot display the value labels large enough to be readable so let us look at the values in the dataframe directly.

    Now let's look at the distributions of the numerical features to get a sense of where the bulk of data sits, if there is any skewness, etc.

    Additional observations from the above plots: When we looked at the core statistics for the features, we saw that satisfaction scores were lower amongst leavers than active employees but that the last evaluation scores did not differ significantly between the groups. In the histogram and boxplot above for the last evaluation score, we see that the distribution is tail-skewed both in the lower and upper boundary, which is an interesting observation. It also makes sense because poor performers are likely to get let go or resign due to lack of fit, and high performers who are not satisfied tend to leave the organization more than their satisfied counterparts especially since they have been with the company longer, work longer hours, and are much less likely to receive promotions in the last 5 years.

    With this, let's explore the data further to see if we can cluster the leavers according to their satisfaction level and last evaluation scores.

    The KDE plot reveals 3 clusters when looking at satisfaction level and last evaluation:

  • High performers who are highly satisfied on the job
  • High performers who are dissatisfied on the job
  • Low perfomers who are somewhat dissatisfied on the job

  • The strongest density is amongst the group of low performers who are somewhat dissatisfied on the job. This is the group of employees who are both not productive nor are they engaged, and as such the withdrawal of these employees is ultimately a good thing.

    The concern for the organization is for the high performers who are leaving, regardless of whether they are dissatisfied or not. While it is expected to see high performers who are dissatisfied leave an organization, an organization should strive to understand why those employees were dissatisfied. It is however equally important to understand why high performers who are satisfied on the job leave the organization. If they are happy on the job, it is generally not expected that they will leave. To try and get an indication of what factors could be at play here, let's take a look at a correlation heatmap to see if there are any correlations that we should focus on.

    The heatmap shows us a few appreciable correlations and observations:

  • Satisfaction level has a negative correlation with attrition which makes sense, as higher satisfaction generally means lower risk of attrition
  • Last evaluation score is positively correlated with number of projects and average monthly hours, so it could be that the more time you spend at work and the higher number of projects you work on partially leads to a more positive evaluation from the fact that you at least appear to be working hard
  • Number of projects and average monthly hours are positively correlated which makes sense since the more projects you have, the more work you have, and thus the longer you must spend at work in a given month. While these may seem like somewhat redundant features, it is not necessarily always the case - it could be that an employee is only on 1 or 0 projects but is spending a lot of overtime at work, in which case it could be an indicator of workplace inefficiencies
  • We already saw in the initial data understanding that the mean satisfaction level for active employees is higher than leavers, but this plot shows us that the IQRs across departments are very similar for active employees however we see a much greater IQR variance for leavers. So why does satisfaction vary so much amongst leavers? Let's dig further.

    We can see leavers in most departments were working quite a few more hours on average a month. This could easily factor into lower satisfaction scores across the board when combined with the other observations we've seen for leavers.

    So now we have a decent understanding of the factors contributing to attrition: lower satisfaction, higher average of monthly working hours, longer tenure with the company with much less likelihood of receiving a promotion in the last 5 years.

    There are two factors we did not investigate in depth: number of projects and work accident. Number of projects was shown to be correlated with average monthly working hours and by itself wouldn't necessarily be of value since in some cases employees could have a low number of projects but high average working hours. As such, this is a secondary factor that is not expected to be a primary driver. Work accident is also considered to be a secondary factor because work accidents severe enough to cause attrition would be due to reasons separate from the employee's satisfaction on the job.

    Now we can move on to build some machine learning models to try and predict who will leave the organization.

    Preprocessing & Model Pipelining

    We will build our preprocessing and modeling into pipelines in this section.

    For preprocessing, we first need to encode the categorical variables into numerical features. There are two categorical features: dept and salary, and we will use one hot encoding to handle this for department and ordinal encoding for salary. Additionally, we will use MinMaxScaler in order to normalize the features.

    First let's start with department where we will use the Scikit-Learn OneHotEncoder rather than the Pandas get_dummies method. The reasons for this choice are:

  • Preprocessing outside of Scikit-Learn can make cross validation scores less reliable
  • OneHotEncoder does not add the encoding into dataframes, thus keeps dataframes smaller and easier to manage
  • Using get_dummies would require us to re-run it for new data that is out of sample (e.g. has new categories)
  • OneHotEncoder allows us to run gridsearch with preprocessing and modeling parameters
  • Now it's time to move on to the modeling. We will try a few different models and compare to see which is best.

    Gaussian Naive-Bayes

    Now let's see which employees the model correctly predicted would leave.

    Overall the scores are not terrible but not great either. With an accuracy of only 74%, it is not likely that gridsearch hyperparamater tuning can improve this model enough. As such, we should investigate another model.

    Logistic Regression

    This model is not much better than the Gaussian Naive-Bayes, so it is also unlikely that gridsearch hyperparameter tuning would help much to improve this model. We continue to investigate other models.

    K-Nearest Neighbours

    This is a good sign and is already significantly better than the Gaussian Naive-Bayes and Logistic Regression models.

    This model looks quite good already without any hyperparameter tuning! Let's investigate another model to see if we can do even better.

    Decision Tree Classifier

    This model is even better than the K-Nearest Neighbours!

    Conclusion & Final Remarks

    Of our models, the Decision Tree Classifer is the most accurate at 98%, followed by K-Nearest Neighbours at 95%, Logistic Regression at 76%, and Gaussian Naive-Bayes at 74%. All models were run with default parameters, and hyperparameter tuning was not pursued in this use-case as it was not deemed necessary.

    For the Logistic Regression and Gaussian Naive-Bayes models, the accuracy was low enough that any improvements from hyperparameter tuning would still not yield a high enough accuracy compared to the other models which are already much more accurate with default parameters. For K-Nearest Neighbours, while already showing a high accuracy, we were able to find an even better model with the Decision Tree Classifer. As the Decision Tree Classifier is showing a near-perfect accuracy with default parameters, the trade-off for computational resources to run an exhaustive grid search for even more optimal parameters would be too costly as any performance improvement gains would be minimal given the already near-perfect accuracy.

    With that said, there are some caveats to be mindful of. The dataset used has a very small feature set and is not fully representative of the feature diversity and quantity that we would see in an HRIS production environment. The dataset is very clean with no missing values which is also not representative of a production environment where incomplete and inaccurate data are day-to-day realities. As such, our models are fairly simple and have the benefit of perfect data which will inherently yield higher accuracy in general.

    However, simpler models tend to outperform more complex models when there is greater data quality and quantity, and the dataset used for this project is certainly of sufficient sample size at 15,000 rows of data. As such, this warrants some confidence in our results as well.